Skip to content

[opt](rowset) Aggregate non-MOW segment key bounds to reduce rowset meta size#62604

Merged
liaoxin01 merged 6 commits intoapache:masterfrom
liaoxin01:feat-non-mow-key-bounds-aggregation
Apr 24, 2026
Merged

[opt](rowset) Aggregate non-MOW segment key bounds to reduce rowset meta size#62604
liaoxin01 merged 6 commits intoapache:masterfrom
liaoxin01:feat-non-mow-key-bounds-aggregation

Conversation

@liaoxin01
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
For non-MOW (duplicate / aggregate key) tables, per-segment key bounds are not consumed on the read path — only the rowset-level [min, max] is used by the reader and ordered compaction. In cloud mode, persisting bounds for every segment can blow past FDB's value size limit on commit_rowset.

Introduce an enable_aggregate_non_mow_key_bounds BE config (default off). When enabled, non-MOW rowsets collapse per-segment bounds into a single [overall_min, overall_max] entry at write time, and compaction preserves this behavior. MOW rowsets always retain per-segment bounds — their lookup_row_key path relies on them for delete bitmap computation, and is guarded by a new DCHECK against aggregated input.

A new optional segments_key_bounds_aggregated flag is added to both RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish aggregated from per-segment layouts. Proto round-trip, pb_convert, snapshot restore, and index builder all preserve both this flag and the existing segments_key_bounds_truncated flag.

Correctness notes:

  • first_key/last_key callers (block_reader, ordered compaction) already bail out on overlapping rowsets, so for non-overlapping rowsets the aggregated [min, max] equals seg[0].min / seg[last].max exactly.
  • merge_rowset_meta (MOW partial-update publish) DCHECKs both sides are non-aggregated.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…eta size

For non-MOW (duplicate / aggregate key) tables, per-segment key bounds are
not consumed on the read path — only the rowset-level [min, max] is used
by the reader and ordered compaction. In cloud mode, persisting bounds
for every segment can blow past FDB's value size limit on commit_rowset.

Introduce an `enable_aggregate_non_mow_key_bounds` BE config (default off).
When enabled, non-MOW rowsets collapse per-segment bounds into a single
[overall_min, overall_max] entry at write time, and compaction preserves
this behavior. MOW rowsets always retain per-segment bounds — their
`lookup_row_key` path relies on them for delete bitmap computation, and
is guarded by a new DCHECK against aggregated input.

A new optional `segments_key_bounds_aggregated` flag is added to both
RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish aggregated
from per-segment layouts. Proto round-trip, pb_convert, snapshot restore,
and index builder all preserve both this flag and the existing
`segments_key_bounds_truncated` flag.

Correctness notes:
- `first_key/last_key` callers (`block_reader`, ordered compaction) already
  bail out on overlapping rowsets, so for non-overlapping rowsets the
  aggregated [min, max] equals seg[0].min / seg[last].max exactly.
- `merge_rowset_meta` (MOW partial-update publish) DCHECKs both sides are
  non-aggregated.
Copilot AI review requested due to automatic review settings April 19, 2026 13:58
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optional optimization to reduce rowset metadata size for non-MOW (duplicate/aggregate-key) tables by aggregating per-segment key bounds into a single rowset-level [min, max] entry, primarily targeting cloud mode FDB value size limits.

Changes:

  • Add enable_aggregate_non_mow_key_bounds BE config (default off) and use it in rowset write + ordered compaction to optionally aggregate non-MOW segments_key_bounds.
  • Add segments_key_bounds_aggregated to RowsetMeta protobufs and propagate it through pb_convert, snapshot restore, index builder, and rowset meta helpers.
  • Add BE unit tests and a regression suite validating aggregated/non-aggregated behavior for non-MOW vs MOW tables.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
regression-test/suites/data_model_p0/duplicate/test_non_mow_key_bounds_aggregation.groovy Regression coverage for config on/off and MOW vs non-MOW layout expectations.
gensrc/proto/olap_file.proto Adds segments_key_bounds_aggregated flag to RowsetMetaPB/RowsetMetaCloudPB.
be/test/storage/rowset/rowset_meta_test.cpp Unit tests for aggregation behavior and truncation interaction.
be/src/storage/task/index_builder.cpp Preserves aggregated/truncated flags when rebuilding rowset meta for index tasks.
be/src/storage/tablet/base_tablet.cpp Adds DCHECK to ensure MOW lookup path never sees aggregated bounds.
be/src/storage/rowset/rowset_meta.h Adds aggregated flag accessors and extends set_segments_key_bounds API.
be/src/storage/rowset/rowset_meta.cpp Implements aggregation in set_segments_key_bounds; adds DCHECK in merge_rowset_meta.
be/src/storage/rowset/rowset.h Exposes is_segments_key_bounds_aggregated() on Rowset.
be/src/storage/rowset/beta_rowset_writer.cpp Aggregates non-MOW key bounds at write time based on new config.
be/src/storage/compaction/compaction.cpp Aggregates non-MOW key bounds for ordered compaction output based on new config.
be/src/common/config.h / be/src/common/config.cpp Declares/defines enable_aggregate_non_mow_key_bounds.
be/src/cloud/pb_convert.cpp Propagates aggregated flag between Doris and cloud rowset meta PBs.
be/src/cloud/cloud_snapshot_mgr.cpp Preserves aggregated/truncated flags during snapshot rowset meta creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/storage/compaction/compaction.cpp Outdated
Comment thread be/src/storage/rowset/rowset_meta.cpp Outdated
Comment thread be/src/storage/tablet/base_tablet.cpp Outdated
@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

@liaoxin01 liaoxin01 force-pushed the feat-non-mow-key-bounds-aggregation branch from 1b267ba to 89e4675 Compare April 19, 2026 14:50
@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

Comment thread be/src/cloud/pb_convert.cpp Outdated
@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24630798753

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 80.90% (72/89) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.22% (20329/38200)
Line Coverage 36.77% (191370/520432)
Region Coverage 33.06% (148692/449708)
Branch Coverage 34.18% (65038/190288)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.02% (73/89) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.84% (27627/37413)
Line Coverage 57.58% (298775/518852)
Region Coverage 54.83% (248849/453891)
Branch Coverage 56.36% (107575/190878)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/22) 🎉
Increment coverage report
Complete coverage report

gavinchou
gavinchou previously approved these changes Apr 20, 2026
Comment thread be/src/cloud/pb_convert.cpp Outdated
Comment thread be/src/cloud/pb_convert.cpp Outdated
Comment thread be/src/cloud/pb_convert.cpp Outdated
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.41% (78/97) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.78% (27600/37407)
Line Coverage 57.47% (298203/518912)
Region Coverage 54.71% (248342/453933)
Branch Coverage 56.22% (107368/190971)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/1) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.41% (78/97) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.78% (27598/37407)
Line Coverage 57.46% (298171/518912)
Region Coverage 54.68% (248202/453933)
Branch Coverage 56.21% (107339/190971)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/1) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 79.38% (77/97) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.30% (20358/38195)
Line Coverage 36.86% (191864/520513)
Region Coverage 33.18% (149223/449777)
Branch Coverage 34.28% (65267/190398)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.41% (78/97) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.88% (27636/37407)
Line Coverage 57.60% (298912/518912)
Region Coverage 54.81% (248797/453933)
Branch Coverage 56.39% (107679/190971)

@liaoxin01
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue.\n\n1. \nThe new aggregated key-bounds flag is not actually preserved in the index rewrite path. is immediately overwritten by the following call, which recomputes the flag as when already contains the single aggregated entry. Rebuilding an aggregated non-MOW rowset via ADD/DROP INDEX therefore emits metadata with but .\n\nCheckpoint summary:\n- Goal: The PR is meant to aggregate non-MOW key bounds and preserve the new layout/flag across write, compaction, cloud-conversion, restore, and index-rewrite paths. The current head still misses the index-rewrite path, so the goal is not fully met.\n- Tests: Unit coverage for and regression coverage for write-time aggregation look good. I did not find coverage for ADD/DROP INDEX on an already aggregated non-MOW rowset, which is why this regression slips through.\n- Concurrency/lifecycle: I did not find a new concurrency or lifecycle issue in the touched paths.\n- Config: The new mutable BE config is observed in the write and ordered-compaction paths. Note that the code currently defaults it to , which is broader than the PR text's "default off".\n- Compatibility/persistence: The proto fields and cloud/restore conversion paths are preserved correctly in the current head.\n- Performance: The intended metadata-size reduction looks fine where the aggregated flag is preserved.

Comment thread be/src/storage/task/index_builder.cpp Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Correction to my earlier review text: the blocking issue is in be/src/storage/task/index_builder.cpp. rowset_meta->set_segments_key_bounds_aggregated(input_rowset_meta->is_segments_key_bounds_aggregated()) is immediately overwritten by the next rowset_meta->set_segments_key_bounds(key_bounds) call. When the input rowset is already aggregated, key_bounds contains the single aggregated entry, but set_segments_key_bounds(key_bounds) runs with aggregate_into_single=false, so the rebuilt rowset loses the aggregated flag while still keeping num_segments > 1 and segments_key_bounds.size() == 1. Ordered compaction relies on the aggregated flag to preserve that layout when the config is off, so ADD/DROP INDEX can emit inconsistent rowset metadata here. Please pass the source flag into set_segments_key_bounds(key_bounds, input_rowset_meta->is_segments_key_bounds_aggregated()) or remove the standalone setter, and add a test that rebuilds an already-aggregated non-MOW rowset through the index path.

@liaoxin01
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1843/2361)
Line Coverage 64.74% (32984/50947)
Region Coverage 65.23% (16362/25085)
Branch Coverage 55.80% (8734/15652)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/63) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.41% (78/97) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.42% (27467/37410)
Line Coverage 57.22% (296995/519023)
Region Coverage 54.29% (246579/454158)
Branch Coverage 55.92% (106832/191028)

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@liaoxin01 liaoxin01 merged commit 618ebd7 into apache:master Apr 24, 2026
32 of 34 checks passed
@liaoxin01 liaoxin01 deleted the feat-non-mow-key-bounds-aggregation branch April 24, 2026 07:52
github-actions Bot pushed a commit that referenced this pull request Apr 24, 2026
…eta size (#62604)

For non-MOW (duplicate / aggregate key) tables, per-segment key bounds
are not consumed on the read path — only the rowset-level [min, max] is
used by the reader and ordered compaction. In cloud mode, persisting
bounds for every segment can blow past FDB's value size limit on
commit_rowset.

Introduce an `enable_aggregate_non_mow_key_bounds` BE config (default
off). When enabled, non-MOW rowsets collapse per-segment bounds into a
single [overall_min, overall_max] entry at write time, and compaction
preserves this behavior. MOW rowsets always retain per-segment bounds —
their `lookup_row_key` path relies on them for delete bitmap
computation, and is guarded by a new DCHECK against aggregated input.

A new optional `segments_key_bounds_aggregated` flag is added to both
RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish
aggregated from per-segment layouts. Proto round-trip, pb_convert,
snapshot restore, and index builder all preserve both this flag and the
existing `segments_key_bounds_truncated` flag.

Correctness notes:
- `first_key/last_key` callers (`block_reader`, ordered compaction)
already bail out on overlapping rowsets, so for non-overlapping rowsets
the aggregated [min, max] equals seg[0].min / seg[last].max exactly.
- `merge_rowset_meta` (MOW partial-update publish) DCHECKs both sides
are non-aggregated.
github-actions Bot pushed a commit that referenced this pull request Apr 24, 2026
…eta size (#62604)

For non-MOW (duplicate / aggregate key) tables, per-segment key bounds
are not consumed on the read path — only the rowset-level [min, max] is
used by the reader and ordered compaction. In cloud mode, persisting
bounds for every segment can blow past FDB's value size limit on
commit_rowset.

Introduce an `enable_aggregate_non_mow_key_bounds` BE config (default
off). When enabled, non-MOW rowsets collapse per-segment bounds into a
single [overall_min, overall_max] entry at write time, and compaction
preserves this behavior. MOW rowsets always retain per-segment bounds —
their `lookup_row_key` path relies on them for delete bitmap
computation, and is guarded by a new DCHECK against aggregated input.

A new optional `segments_key_bounds_aggregated` flag is added to both
RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish
aggregated from per-segment layouts. Proto round-trip, pb_convert,
snapshot restore, and index builder all preserve both this flag and the
existing `segments_key_bounds_truncated` flag.

Correctness notes:
- `first_key/last_key` callers (`block_reader`, ordered compaction)
already bail out on overlapping rowsets, so for non-overlapping rowsets
the aggregated [min, max] equals seg[0].min / seg[last].max exactly.
- `merge_rowset_meta` (MOW partial-update publish) DCHECKs both sides
are non-aggregated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants